Code
library(DBI)
library(RSQLite)
library(tidyverse)
Tony Duan
This document explains how to connect to and interact with SQL databases from within R. We will cover writing raw SQL queries, using the tidyverse
-friendly dbplyr
package, and other best practices.
The DBI
package provides a universal, consistent interface for connecting to any database. Specific packages like RSQLite
, RPostgres
, or odbc
provide the actual drivers to connect to different database systems.
For these examples, we will use RSQLite
, which creates a lightweight, file-based SQL database.
First, we connect to a database file (it will be created if it doesn’t exist). Then, we write the mtcars
and iris
R data frames into the database as new tables.
# Add car names as a column in mtcars
mtcars_df =cbind(car_name = rownames(mtcars), mtcars)
# Create a connection to an SQLite database file
con <- dbConnect(RSQLite::SQLite(), "data/my_sql_database.db")
# Write the R data frames to the database
# `overwrite = TRUE` will replace the table if it already exists
dbWriteTable(con, "mtcars", mtcars_df, overwrite = TRUE)
dbWriteTable(con, "iris", iris, overwrite = TRUE)
We can list the tables to confirm they were created.
The dbGetQuery()
function sends a SQL statement to the database and returns the result as a data frame.
Select all columns from the mtcars
table, limiting the result to the first 3 rows.
Use AS
to rename a column in the output. You can also create new columns based on calculations.
The WHERE
clause filters rows based on a condition.
car_name mpg cyl disp hp drat wt qsec vs am gear carb
1 Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
2 Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
3 Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
4 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
5 Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
6 Ford Pantera L 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
GROUP BY
collapses multiple rows into a single summary row. It’s used with aggregate functions like AVG()
, COUNT()
, MIN()
, and MAX()
.
This query calculates the average horsepower and row count for each cylinder group.
ORDER BY
sorts the result set in ascending (ASC
) or descending (DESC
) order.
This example creates two tables, car_parts
and part_info
, to demonstrate joins.
Returns only the rows where the join key (part_id
) exists in both tables.
Returns all rows from the left table (car_parts
) and the matched rows from the right table (part_info
).
SQL has functions for manipulating strings and dates. The exact syntax can vary by database. In SQLite, you can use UPPER()
and strftime()
.
UPDATE
modifies existing records in a table.
CREATE TABLE
builds a new table, and DROP TABLE
deletes it.
[1] 0
[1] "car_parts" "iris" "mtcars" "mtcars_copy" "part_info"
dbplyr
dbplyr
is a tidyverse
backend for databases. It allows you to write familiar dplyr
code, which dbplyr
translates into SQL for you. This is the recommended modern approach as it is easier to write and less prone to SQL injection errors.
First, we create a connection to the mtcars
table in the database.
Now, we can use dplyr
verbs on this object. The code is not run in R; it is translated to SQL and sent to the database.
We can use show_query()
to see the SQL that dbplyr
generated.
<SQL>
SELECT `cyl`, AVG(`mpg`) AS `avg_mpg`
FROM (
SELECT `car_name`, `mpg`, `hp`, `cyl`
FROM `mtcars`
WHERE (`hp` > 150.0)
) AS `q01`
GROUP BY `cyl`
ORDER BY `avg_mpg` DESC
To execute the query and pull the results back into an R data frame, we use collect()
.
To prevent SQL injection attacks, you should not use paste()
to insert values into a SQL query. Instead, use parameterized queries with ?
as a placeholder. DBI
will safely insert the values for you.
car_name hp
1 Hornet Sportabout 175
2 Duster 360 245
3 Merc 450SE 180
4 Merc 450SL 180
5 Merc 450SLC 180
6 Cadillac Fleetwood 205
sqldf
on Local Data FramesThe sqldf
package allows you to run SQL queries on local R data frames, not on an external database. This can be useful for users who are more comfortable with SQL than with dplyr
syntax for data manipulation.
It is very important to close the database connection when you are finished to release resources.
---
title: "Working with SQL Databases in R"
author: "Tony Duan"
execute:
warning: false
error: false
format:
html:
toc: true
toc-location: right
code-fold: show
code-tools: true
number-sections: false
code-block-bg: true
code-block-border-left: "#31BAE9"
---
This document explains how to connect to and interact with SQL databases from within R. We will cover writing raw SQL queries, using the `tidyverse`-friendly `dbplyr` package, and other best practices.
{width="600"}
# 1. Connecting to a Database
The `DBI` package provides a universal, consistent interface for connecting to any database. Specific packages like `RSQLite`, `RPostgres`, or `odbc` provide the actual drivers to connect to different database systems.
For these examples, we will use `RSQLite`, which creates a lightweight, file-based SQL database.
```{r}
library(DBI)
library(RSQLite)
library(tidyverse)
```
## Create and Populate a Database
First, we connect to a database file (it will be created if it doesn't exist). Then, we write the `mtcars` and `iris` R data frames into the database as new tables.
```{r}
# Add car names as a column in mtcars
mtcars_df =cbind(car_name = rownames(mtcars), mtcars)
# Create a connection to an SQLite database file
con <- dbConnect(RSQLite::SQLite(), "data/my_sql_database.db")
# Write the R data frames to the database
# `overwrite = TRUE` will replace the table if it already exists
dbWriteTable(con, "mtcars", mtcars_df, overwrite = TRUE)
dbWriteTable(con, "iris", iris, overwrite = TRUE)
```
We can list the tables to confirm they were created.
```{r}
dbListTables(con)
```
# 2. Method 1: Writing Raw SQL Queries
The `dbGetQuery()` function sends a SQL statement to the database and returns the result as a data frame.
## SELECT Statement
Select all columns from the `mtcars` table, limiting the result to the first 3 rows.
```{r}
sql_select <- "SELECT * FROM mtcars LIMIT 3"
dbGetQuery(con, sql_select)
```
## Column Aliasing and Creation
Use `AS` to rename a column in the output. You can also create new columns based on calculations.
```{r}
sql_create_col <- "SELECT mpg, mpg + 1 AS mpg_plus_one FROM mtcars"
head(dbGetQuery(con, sql_create_col))
```
## Filtering with WHERE
The `WHERE` clause filters rows based on a condition.
```{r}
sql_filter <- "SELECT * FROM mtcars WHERE hp > 200 AND cyl = 8"
head(dbGetQuery(con, sql_filter))
```
## Aggregation with GROUP BY
`GROUP BY` collapses multiple rows into a single summary row. It's used with aggregate functions like `AVG()`, `COUNT()`, `MIN()`, and `MAX()`.
This query calculates the average horsepower and row count for each cylinder group.
```{r}
sql_groupby <- "SELECT cyl, AVG(hp) AS avg_hp, COUNT(*) AS num_cars FROM mtcars GROUP BY cyl"
dbGetQuery(con, sql_groupby)
```
## Sorting with ORDER BY
`ORDER BY` sorts the result set in ascending (`ASC`) or descending (`DESC`) order.
```{r}
sql_orderby <- "SELECT car_name, mpg, hp FROM mtcars ORDER BY hp DESC LIMIT 5"
dbGetQuery(con, sql_orderby)
```
## Joins
This example creates two tables, `car_parts` and `part_info`, to demonstrate joins.
```{r}
dbWriteTable(con, "car_parts", data.frame(car_name = c("Mazda RX4", "Datsun 710", "Hornet 4 Drive"), part_id = c(1, 2, 3)), overwrite = TRUE)
dbWriteTable(con, "part_info", data.frame(part_id = c(1, 2, 4), part_name = c("Engine", "Tire", "Brake")), overwrite = TRUE)
```
### INNER JOIN
Returns only the rows where the join key (`part_id`) exists in both tables.
```{r}
sql_inner_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
INNER JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_inner_join)
```
### LEFT JOIN
Returns all rows from the left table (`car_parts`) and the matched rows from the right table (`part_info`).
```{r}
sql_left_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
LEFT JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_left_join)
```
## String and Date Functions
SQL has functions for manipulating strings and dates. The exact syntax can vary by database. In SQLite, you can use `UPPER()` and `strftime()`.
```{r}
sql_string_date <- "SELECT car_name, UPPER(car_name) AS upper_name, '2023-10-27' AS today from car_parts"
head(dbGetQuery(con, sql_string_date))
```
## Modifying Data
### UPDATE Table
`UPDATE` modifies existing records in a table.
```{r}
# Note: dbExecute is used for statements that don't return data
dbExecute(con, "UPDATE mtcars SET hp = 120 WHERE car_name = 'Mazda RX4'")
dbGetQuery(con, "SELECT car_name, hp FROM mtcars WHERE car_name = 'Mazda RX4'")
```
### CREATE and DROP Table
`CREATE TABLE` builds a new table, and `DROP TABLE` deletes it.
```{r}
sql_create <- "CREATE TABLE mtcars_copy AS SELECT * FROM mtcars"
dbExecute(con, sql_create)
dbListTables(con)
```
```{r}
sql_drop <- "DROP TABLE mtcars_copy"
dbExecute(con, sql_drop)
dbListTables(con)
```
# 3. Method 2: Using `dbplyr`
`dbplyr` is a `tidyverse` backend for databases. It allows you to write familiar `dplyr` code, which `dbplyr` translates into SQL for you. This is the recommended modern approach as it is easier to write and less prone to SQL injection errors.
First, we create a connection to the `mtcars` table in the database.
```{r}
mtcars_db <- tbl(con, "mtcars")
```
Now, we can use `dplyr` verbs on this object. The code is not run in R; it is translated to SQL and sent to the database.
```{r}
query <- mtcars_db %>%
select(car_name, mpg, hp, cyl) %>%
filter(hp > 150) %>%
group_by(cyl) %>%
summarise(avg_mpg = mean(mpg, na.rm = TRUE)) %>%
arrange(desc(avg_mpg))
```
We can use `show_query()` to see the SQL that `dbplyr` generated.
```{r}
show_query(query)
```
To execute the query and pull the results back into an R data frame, we use `collect()`.
```{r}
results <- collect(query)
results
```
# 4. Best Practice: Parameterized Queries
To prevent SQL injection attacks, you should not use `paste()` to insert values into a SQL query. Instead, use parameterized queries with `?` as a placeholder. `DBI` will safely insert the values for you.
```{r}
hp_threshold <- 150
sql_safe <- "SELECT car_name, hp FROM mtcars WHERE hp > ?"
# dbGetQuery can take a `params` list
dbGetQuery(con, sql_safe, params = list(hp_threshold)) %>% head()
```
# 5. Method 3: Using `sqldf` on Local Data Frames
The `sqldf` package allows you to run SQL queries on local R data frames, not on an external database. This can be useful for users who are more comfortable with SQL than with `dplyr` syntax for data manipulation.
```{r}
library(sqldf)
# Note the use of the R data frame `mtcars_df`
sqldf("SELECT car_name, mpg FROM mtcars_df WHERE cyl = 8 LIMIT 5")
```
# 6. Disconnecting from the Database
It is very important to close the database connection when you are finished to release resources.
```{r}
dbDisconnect(con)
```